Skip to content

Add ontology support to dataset features endpoint#262

Merged
PGijsbers merged 6 commits intoopenml:mainfrom
hxrshxz:fix/237-ontology-support
Mar 16, 2026
Merged

Add ontology support to dataset features endpoint#262
PGijsbers merged 6 commits intoopenml:mainfrom
hxrshxz:fix/237-ontology-support

Conversation

@hxrshxz
Copy link
Contributor

@hxrshxz hxrshxz commented Feb 28, 2026

Closes #237

Fetches ontology URIs from data_feature_description and includes them in GET /datasets/features/{dataset_id} responses, achieving feature parity with the PHP API.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2ba2d4e5-f926-4b5c-8d5f-45a59eb13815

📥 Commits

Reviewing files that changed from the base of the PR and between ab592f7 and 0d79669.

📒 Files selected for processing (2)
  • tests/routers/openml/datasets_test.py
  • tests/routers/openml/migration/datasets_migration_test.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • tests/routers/openml/datasets_test.py
  • tests/routers/openml/migration/datasets_migration_test.py

Walkthrough

This PR adds test coverage for ontology support in dataset features. A new test test_dataset_features_with_ontology validates that ontology data is correctly retrieved and assigned to dataset features, checking specific ontology URLs for certain features while ensuring features without ontology data do not include the field. The datasets migration test is updated to normalize ontology fields in-place rather than removing them entirely, mirroring the approach used for nominal values.

Possibly related PRs

  • Adds ontology support  #255: Implements the backend ontology support (database retrieval, schema field, and assignment to feature.ontology) that these test changes validate against.
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely describes the main change: adding ontology support to the dataset features endpoint.
Description check ✅ Passed The description directly relates to the changeset by referencing issue #237 and explaining the feature implementation for ontology support.
Linked Issues check ✅ Passed The changes implement ontology support in the dataset features endpoint by fetching from data_feature_description and normalizing values, fulfilling the requirements in issue #237.
Out of Scope Changes check ✅ Passed All changes are directly scoped to adding ontology support to dataset features, with no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 1 issue, and left some high level feedback:

  • In get_feature_ontologies, consider using rows.mappings() and accessing columns via row["index"] / row["value"] (or row._mapping[...]) instead of row.index and row.value to avoid potential conflicts with Row.index and make the column access more explicit and robust.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `get_feature_ontologies`, consider using `rows.mappings()` and accessing columns via `row["index"]` / `row["value"]` (or `row._mapping[...]`) instead of `row.index` and `row.value` to avoid potential conflicts with `Row.index` and make the column access more explicit and robust.

## Individual Comments

### Comment 1
<location path="src/database/datasets.py" line_range="134-143" />
<code_context>
     return [Feature(**row, nominal_values=None) for row in rows.mappings()]


+def get_feature_ontologies(dataset_id: int, connection: Connection) -> dict[int, list[str]]:
+    """Return a mapping from feature index to its list of ontology URIs."""
+    rows = connection.execute(
+        text(
+            """
+            SELECT `index`, `value`
+            FROM data_feature_description
+            WHERE `did` = :dataset_id AND `description_type` = 'ontology'
+            """,
+        ),
+        parameters={"dataset_id": dataset_id},
+    )
+    ontologies: dict[int, list[str]] = {}
+    for row in rows:
+        ontologies.setdefault(row.index, []).append(row.value)
+    return ontologies
</code_context>
<issue_to_address>
**issue (bug_risk):** Accessing `row.index`/`row.value` directly on SQLAlchemy `Row` can be fragile; prefer `.mappings()` or `_mapping[...]` for robustness.

In this loop you’re iterating raw `Row` objects and relying on `row.index` / `row.value`, which can collide with `Row.index()` and other internals, and may change across SQLAlchemy versions. Use mapping access instead, e.g. iterate `rows.mappings()` and use `row["index"]` / `row["value"]`, or use `row._mapping["index"]` / `row._mapping["value"]` on the existing result to avoid these name clashes.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/routers/openml/datasets.py (1)

296-298: Optional: defer ontology query until after empty-feature guard.

Line 296 currently performs an extra read even when no features exist and the endpoint will raise immediately.

⚡ Small query-order optimization
-    ontologies = database.datasets.get_feature_ontologies(dataset_id, expdb)
-    for feature in features:
-        feature.ontology = ontologies.get(feature.index)
-
     if not features:
         processing_state = database.datasets.get_latest_processing_update(dataset_id, expdb)
         ...
         raise HTTPException(...)

+    ontologies = database.datasets.get_feature_ontologies(dataset_id, expdb)
+    for feature in features:
+        feature.ontology = ontologies.get(feature.index)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/routers/openml/datasets.py` around lines 296 - 298, The code calls
database.datasets.get_feature_ontologies(dataset_id, expdb) before checking if
there are any features, causing an unnecessary DB read; change the order so you
first check the features collection (e.g., if not features: return or raise as
the endpoint expects) and only call get_feature_ontologies(dataset_id, expdb)
when features is non-empty, then iterate and set feature.ontology =
ontologies.get(feature.index); this uses the existing symbols features,
dataset_id, expdb, and database.datasets.get_feature_ontologies.
tests/routers/openml/migration/datasets_migration_test.py (1)

225-228: Consider deduplicating legacy list-to-scalar normalization.

Line 224 and Line 228 apply the same conversion pattern; extracting a tiny helper would reduce drift in future parity tweaks.

♻️ Minimal refactor
+    def _legacy_scalar_or_list(values: list[str]) -> str | list[str]:
+        return values if len(values) > 1 else values[0]
+
     for feature in python_body:
         for key, value in list(feature.items()):
             if key == "nominal_values":
                 values = feature.pop(key)
-                feature["nominal_value"] = values if len(values) > 1 else values[0]
+                feature["nominal_value"] = _legacy_scalar_or_list(values)
             elif key == "ontology":
                 values = feature.pop(key)
-                feature["ontology"] = values if len(values) > 1 else values[0]
+                feature["ontology"] = _legacy_scalar_or_list(values)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/routers/openml/migration/datasets_migration_test.py` around lines 225 -
228, Extract the repeated "list-to-scalar" normalization into a small helper
(e.g., normalize_legacy_list(value) or normalize_list_to_scalar) and replace
both in-place patterns that operate on feature["ontology"] (the block that does
values = feature.pop(key); feature["ontology"] = values if len(values) > 1 else
values[0]) with a call to that helper so the code becomes feature["ontology"] =
normalize_legacy_list(feature.pop(key)); ensure the helper accepts the popped
value (either list or str) and returns the list when length > 1 or the single
element otherwise.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/database/datasets.py`:
- Around line 139-142: The SQL selecting ontology entries from
data_feature_description (WHERE `did` = :dataset_id AND `description_type` =
'ontology') lacks an ORDER BY, causing non-deterministic array ordering; update
that query (and the similar occurrence that also selects ontology rows) to
append "ORDER BY `index` ASC" (or ORDER BY `index`) so results are consistently
ordered by the `index` column—apply this change to both places that query
data_feature_description for description_type='ontology'.
- Around line 147-148: The loop uses positional attribute access on SQLAlchemy
Row (for row in rows: ontologies.setdefault(row.index, []).append(row.value)),
but row.index resolves to the Row.index() method in SQLAlchemy 2.x; change to
mapping-style access (e.g., row['index'] and row['value'] or
row._mapping['index'] / row._mapping['value']) so the dictionary key is the
actual column value; update the code that iterates over rows and uses
ontologies.setdefault to use these mapping keys instead.

---

Nitpick comments:
In `@src/routers/openml/datasets.py`:
- Around line 296-298: The code calls
database.datasets.get_feature_ontologies(dataset_id, expdb) before checking if
there are any features, causing an unnecessary DB read; change the order so you
first check the features collection (e.g., if not features: return or raise as
the endpoint expects) and only call get_feature_ontologies(dataset_id, expdb)
when features is non-empty, then iterate and set feature.ontology =
ontologies.get(feature.index); this uses the existing symbols features,
dataset_id, expdb, and database.datasets.get_feature_ontologies.

In `@tests/routers/openml/migration/datasets_migration_test.py`:
- Around line 225-228: Extract the repeated "list-to-scalar" normalization into
a small helper (e.g., normalize_legacy_list(value) or normalize_list_to_scalar)
and replace both in-place patterns that operate on feature["ontology"] (the
block that does values = feature.pop(key); feature["ontology"] = values if
len(values) > 1 else values[0]) with a call to that helper so the code becomes
feature["ontology"] = normalize_legacy_list(feature.pop(key)); ensure the helper
accepts the popped value (either list or str) and returns the list when length >
1 or the single element otherwise.

ℹ️ Review info

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72989df and 5280c6a.

📒 Files selected for processing (5)
  • src/database/datasets.py
  • src/routers/openml/datasets.py
  • src/schemas/datasets/openml.py
  • tests/routers/openml/datasets_test.py
  • tests/routers/openml/migration/datasets_migration_test.py

@hxrshxz hxrshxz force-pushed the fix/237-ontology-support branch from ab592f7 to d132eb2 Compare February 28, 2026 17:58
@PGijsbers PGijsbers mentioned this pull request Mar 16, 2026
Copilot AI review requested due to automatic review settings March 16, 2026 13:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review is ineligible. To be eligible to request a review, you need a paid Copilot license, or your organization must enable Copilot code review.

@codecov
Copy link

codecov bot commented Mar 16, 2026

Codecov Report

❌ Patch coverage is 25.00000% with 3 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (main@c617d6d). Learn more about missing BASE report.

Files with missing lines Patch % Lines
src/routers/openml/datasets.py 0.00% 3 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #262   +/-   ##
=======================================
  Coverage        ?   54.44%           
=======================================
  Files           ?       34           
  Lines           ?     1462           
  Branches        ?      119           
=======================================
  Hits            ?      796           
  Misses          ?      664           
  Partials        ?        2           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PGijsbers PGijsbers marked this pull request as draft March 16, 2026 13:49
@PGijsbers PGijsbers marked this pull request as ready for review March 16, 2026 13:49
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've left some high level feedback:

  • The new ontology test relies on magic dataset IDs and feature indices (e.g. dataset 11, features 1–3); consider using fixtures or named constants to make the intent clearer and reduce coupling to specific seeded data.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- The new ontology test relies on magic dataset IDs and feature indices (e.g. dataset 11, features 1–3); consider using fixtures or named constants to make the intent clearer and reduce coupling to specific seeded data.

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@PGijsbers
Copy link
Contributor

Trying to trigger CI on the latest commit.

@PGijsbers PGijsbers closed this Mar 16, 2026
@PGijsbers PGijsbers reopened this Mar 16, 2026
@PGijsbers PGijsbers merged commit c9ada15 into openml:main Mar 16, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ontology support

3 participants